Recommendations with IBM¶
In this notebook, you will be putting your recommendation skills to use on real data from the IBM Watson Studio platform.
You may either submit your notebook through the workspace here, or you may work from your local machine and submit through the next page. Either way assure that your code passes the project RUBRIC. Please save regularly.
By following the table of contents, you will build out a number of different methods for making recommendations that can be used for different situations.
Table of Contents¶
I. Exploratory Data Analysis
II. Rank Based Recommendations
III. User-User Based Collaborative Filtering
IV. Content Based Recommendations (EXTRA - NOT REQUIRED)
V. Matrix Factorization
VI. Extras & Concluding
At the end of the notebook, you will find directions for how to submit your work. Let's get started by importing the necessary libraries and reading in the data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tests.project_tests as t
import pickle
import os
import plotly.graph_objects as po
from typing import List
df = pd.read_csv(f"{os.getcwd().replace('notebooks', 'data')}/user-item-interactions.csv")
df_content = pd.read_csv(f"{os.getcwd().replace('notebooks', 'data')}/articles_community.csv")
del df["Unnamed: 0"]
del df_content["Unnamed: 0"]
# Show df to get an idea of the data
df.head()
| article_id | title | ||
|---|---|---|---|
| 0 | 1430.0 | using pixiedust for fast, flexible, and easier... | ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7 |
| 1 | 1314.0 | healthcare python streaming application demo | 083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b |
| 2 | 1429.0 | use deep learning for image classification | b96a4f2e92d8572034b1e9b28f9ac673765cd074 |
| 3 | 1338.0 | ml optimization using cognitive assistant | 06485706b34a5c9bf2a0ecdac41daf7e7654ceb7 |
| 4 | 1276.0 | deploy your python model as a restful api | f01220c46fc92c6e6b161b1849de11faacd7ccb2 |
# Show df_content to get an idea of the data
df_content.head()
| doc_body | doc_description | doc_full_name | doc_status | article_id | |
|---|---|---|---|---|---|
| 0 | Skip navigation Sign in SearchLoading...\r\n\r... | Detect bad readings in real time using Python ... | Detect Malfunctioning IoT Sensors with Streami... | Live | 0 |
| 1 | No Free Hunch Navigation * kaggle.com\r\n\r\n ... | See the forest, see the trees. Here lies the c... | Communicating data science: A guide to present... | Live | 1 |
| 2 | ā° * Login\r\n * Sign Up\r\n\r\n * Learning Pat... | Hereās this weekās news in Data Science and Bi... | This Week in Data Science (April 18, 2017) | Live | 2 |
| 3 | DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA... | Learn how distributed DBs solve the problem of... | DataLayer Conference: Boost the performance of... | Live | 3 |
| 4 | Skip navigation Sign in SearchLoading...\r\n\r... | This video demonstrates the power of IBM DataS... | Analyze NY Restaurant data using Spark in DSX | Live | 4 |
Part I : Exploratory Data Analysis¶
Use the dictionary and cells below to provide some insight into the descriptive statistics of the data.
1. What is the distribution of how many articles a user interacts with in the dataset? Provide a visual and descriptive statistics to assist with giving a look at the number of times each user interacts with an article.
user_interactions = df.groupby(by=["email"]).size().sort_values(ascending=False)
# user_interactions = df["email"].value_counts()
display(user_interactions)
median_val = user_interactions.median()
max_views_by_user = user_interactions.max()
email
2b6c0f514c2f2b04ad3c4583407dccd0810469ee 364
77959baaa9895a7e2bdc9297f8b27c1b6f2cb52a 363
2f5c7feae533ce046f2cb16fb3a29fe00528ed66 170
a37adec71b667b297ed2440a9ff7dad427c7ac85 169
8510a5010a5d4c89f5b07baac6de80cd12cfaf93 160
...
9655144418d25a0e074616840447e6e5dbef0069 1
9656a8f1059d7af6be6ddaec889c66bc9d402b77 1
96654c6e066d002e5b44f6e9e38217c10c81f698 1
966ca71b9b2ea0dc5c0cb0cd9f523cdc7ad2f0cc 1
9678e0a3f95203d23df78f8d733d22eae4a07b0c 1
Length: 5148, dtype: int64
fig = po.Figure()
fig.add_trace(po.Histogram(x=user_interactions))
fig.update_layout(
title_text="Distribution of user-article interactions",
xaxis_title="Number of articles interacted with",
yaxis_title="Number of users",
template="plotly_dark"
)
2. Explore and remove duplicate articles from the df_content dataframe.
df_content_without_duplicates = df_content.drop_duplicates(subset="article_id", keep="first")
display(len(df_content))
display(len(df_content_without_duplicates))
1056
1051
3. Use the cells below to find:
a. The number of unique articles that have an interaction with a user.
b. The number of unique articles in the dataset (whether they have any interactions or not).
c. The number of unique users in the dataset. (excluding null values)
d. The number of user-article interactions in the dataset.
display(df_content["article_id"].unique().shape)
display(df["article_id"].unique().shape)
display(df["email"].nunique())
(1051,)
(714,)
5148
unique_articles = df["article_id"].unique().shape[0]
total_articles = df_content["article_id"].unique().shape[0]
unique_users = df["email"].nunique()
user_article_interactions = len(df)
4. Use the cells below to find the most viewed article_id, as well as how often it was viewed. After talking to the company leaders, the email_mapper function was deemed a reasonable way to map users to ids. There were a small number of null values, and it was found that all of these null values likely belonged to a single user (which is how they are stored using the function below).
article_interactions = df.groupby(by=["article_id"]).size().sort_values(ascending=False)
display(article_interactions)
display(article_interactions[article_interactions.idxmax()])
article_id
1429.0 937
1330.0 927
1431.0 671
1427.0 643
1364.0 627
...
984.0 1
1344.0 1
675.0 1
662.0 1
653.0 1
Length: 714, dtype: int64
np.int64(937)
most_viewed_article_id = str(article_interactions.idxmax())
max_views = article_interactions[article_interactions.idxmax()]
## No need to change the code here - this will be helpful for later parts of the notebook
# Run this cell to map the user email to a user_id column and remove the email column
def email_mapper():
coded_dict = dict()
cter = 1
email_encoded = []
for val in df["email"]:
if val not in coded_dict:
coded_dict[val] = cter
cter += 1
email_encoded.append(coded_dict[val])
return email_encoded
email_encoded = email_mapper()
del df["email"]
df["user_id"] = email_encoded
# show header
df.head()
| article_id | title | user_id | |
|---|---|---|---|
| 0 | 1430.0 | using pixiedust for fast, flexible, and easier... | 1 |
| 1 | 1314.0 | healthcare python streaming application demo | 2 |
| 2 | 1429.0 | use deep learning for image classification | 3 |
| 3 | 1338.0 | ml optimization using cognitive assistant | 4 |
| 4 | 1276.0 | deploy your python model as a restful api | 5 |
## If you stored all your results in the variable names above,
## you shouldn't need to change anything in this cell
sol_1_dict = {
"`50% of individuals have _____ or fewer interactions.`": median_val,
"`The total number of user-article interactions in the dataset is ______.`": user_article_interactions,
"`The maximum number of user-article interactions by any 1 user is ______.`": max_views_by_user,
"`The most viewed article in the dataset was viewed _____ times.`": max_views,
"`The article_id of the most viewed article is ______.`": most_viewed_article_id,
"`The number of unique articles that have at least 1 rating ______.`": unique_articles,
"`The number of unique users in the dataset is ______`": unique_users,
"`The number of unique articles on the IBM platform`": total_articles,
}
# Test your dictionary against the solution
t.sol_1_test(sol_1_dict)
It looks like you have everything right here! Nice job!
Part II: Rank-Based Recommendations¶
Unlike in the earlier lessons, we don't actually have ratings for whether a user liked an article or not. We only know that a user has interacted with an article. In these cases, the popularity of an article can really only be based on how often an article was interacted with.
1. Fill in the function below to return the n top articles ordered with most interactions as the top. Test your function using the tests below.
def get_top_article_ids(n, df=df):
"""
INPUT:
n - (int) the number of top articles to return
df - (pandas dataframe) df as defined at the top of the notebook
OUTPUT:
top_articles_ids - (list) A list of the top 'n' article titles
"""
top_articles_ids = df.groupby(by=["article_id"]).size().sort_values(ascending=False)
return top_articles_ids.iloc[:n].index.to_list()
def get_top_articles(n, df=df):
"""
INPUT:
n - (int) the number of top articles to return
df - (pandas dataframe) df as defined at the top of the notebook
OUTPUT:
top_articles - (list) A list of the top 'n' article titles
"""
article_id_title_mapping = df[["article_id", "title"]].drop_duplicates().set_index("article_id").to_dict()["title"]
top_articles_ids = get_top_article_ids(n=n, df=df)
top_articles = list(map(lambda id: article_id_title_mapping.get(id), top_articles_ids))
return top_articles
print(get_top_articles(10))
print(get_top_article_ids(10))
['use deep learning for image classification', 'insights from new york car accident reports', 'visualize car data with brunel', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'healthcare python streaming application demo', 'finding optimal locations of new store using decision optimization', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model'] [1429.0, 1330.0, 1431.0, 1427.0, 1364.0, 1314.0, 1293.0, 1170.0, 1162.0, 1304.0]
# Test your function by returning the top 5, 10, and 20 articles
top_5 = get_top_articles(5)
top_10 = get_top_articles(10)
top_20 = get_top_articles(20)
# Test each of your three lists from above
t.sol_2_test(get_top_articles)
Your top_5 looks like the solution list! Nice job. Your top_10 looks like the solution list! Nice job. Your top_20 looks like the solution list! Nice job.
Part III: User-User Based Collaborative Filtering¶
1. Use the function below to reformat the df dataframe to be shaped with users as the rows and articles as the columns.
Each user should only appear in each row once.
Each article should only show up in one column.
If a user has interacted with an article, then place a 1 where the user-row meets for that article-column. It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an article should be a 1.
If a user has not interacted with an item, then place a zero where the user-row meets for that article-column.
Use the tests to make sure the basic structure of your matrix matches what is expected by the solution.
# create the user-article matrix with 1's and 0's
def create_user_item_matrix(df):
"""
INPUT:
df - pandas dataframe with article_id, title, user_id columns
OUTPUT:
user_item - user item matrix
Description:
Return a matrix with user ids as rows and article ids on the columns with 1 values where a user interacted with
an article and a 0 otherwise
"""
user_item = df.pivot_table(index="user_id", columns="article_id", aggfunc="size", fill_value=0)
user_item[user_item > 0] = 1
return user_item
user_item = create_user_item_matrix(df)
display(user_item)
| article_id | 0.0 | 2.0 | 4.0 | 8.0 | 9.0 | 12.0 | 14.0 | 15.0 | 16.0 | 18.0 | ... | 1434.0 | 1435.0 | 1436.0 | 1437.0 | 1439.0 | 1440.0 | 1441.0 | 1442.0 | 1443.0 | 1444.0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | |||||||||||||||||||||
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5145 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5146 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5147 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5148 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5149 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5149 rows Ć 714 columns
## Tests: You should just need to run this cell. Don't change the code.
assert (
user_item.shape[0] == 5149
), "Oops! The number of users in the user-article matrix doesn't look right."
assert (
user_item.shape[1] == 714
), "Oops! The number of articles in the user-article matrix doesn't look right."
assert (
user_item.sum(axis=1)[1] == 36
), "Oops! The number of articles seen by user 1 doesn't look right."
print("You have passed our quick tests! Please proceed!")
You have passed our quick tests! Please proceed!
2. Complete the function below which should take a user_id and provide an ordered list of the most similar users to that user (from most similar to least similar). The returned result should not contain the provided user_id, as we know that each user is similar to him/herself. Because the results for each user here are binary, it (perhaps) makes sense to compute similarity as the dot product of two users.
Use the tests to test your function.
def find_similar_users(user_id: int, user_item: pd.DataFrame = user_item):
"""
INPUT:
user_id - (int) a user_id
user_item - (pandas dataframe) matrix of users by articles:
1's when a user has interacted with an article, 0 otherwise
OUTPUT:
similar_users - (list) an ordered list where the closest users (largest dot product users)
are listed first
Description:
Computes the similarity of every pair of users based on the dot product
Returns an ordered
"""
similarity_scores = user_item.dot(user_item.loc[user_id])
sorted_scores = similarity_scores.sort_values(ascending=False)
similar_users_ids = sorted_scores.index.to_list()
similar_users_ids.remove(user_id)
return similar_users_ids
# Do a spot check of your function
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print(
"The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933)[:5])
)
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46)[:3]))
The 10 most similar users to user 1 are: [3933, 23, 3782, 203, 4459, 3870, 131, 46, 4201, 395] The 5 most similar users to user 3933 are: [1, 23, 3782, 203, 4459] The 3 most similar users to user 46 are: [4201, 3782, 23]
3. Now that you have a function that provides the most similar users to each user, you will want to use these users to find articles you can recommend. Complete the functions below to return the articles you would recommend to each user.
def get_article_names(article_ids: list, df: pd.DataFrame = df):
"""
INPUT:
article_ids - (list) a list of article ids
df - (pandas dataframe) df as defined at the top of the notebook
OUTPUT:
article_names - (list) a list of article names associated with the list of article ids
(this is identified by the title column)
"""
article_ids = [str(id) for id in article_ids]
df['article_id'] = df['article_id'].astype(str)
article_id_title_mapping: dict = df[["article_id", "title"]].drop_duplicates().set_index("article_id").to_dict()["title"]
article_names = list(map(lambda id: article_id_title_mapping.get(id), article_ids))
return article_names
def get_user_articles(user_id: int, user_item: pd.DataFrame = user_item):
"""
INPUT:
user_id - (int) a user id
user_item - (pandas dataframe) matrix of users by articles:
1's when a user has interacted with an article, 0 otherwise
OUTPUT:
article_ids - (list) a list of the article ids seen by the user
article_names - (list) a list of article names associated with the list of article ids
(this is identified by the doc_full_name column in df_content)
Description:
Provides a list of the article_ids and article titles that have been seen by a user
"""
article_ids = user_item.loc[user_id][user_item.loc[user_id] == 1].index.to_list()
article_names = get_article_names(article_ids=article_ids)
return [str(id) for id in article_ids], article_names
def user_user_recs(user_id: int, m: int = 10):
"""
INPUT:
user_id - (int) a user id
m - (int) the number of recommendations you want for the user
OUTPUT:
recs - (list) a list of recommendations for the user
Description:
Loops through the users based on closeness to the input user_id
For each user - finds articles the user hasn't seen before and provides them as recs
Does this until m recommendations are found
Notes:
Users who are the same closeness are chosen arbitrarily as the 'next' user
For the user where the number of recommended articles starts below m
and ends exceeding m, the last items are chosen arbitrarily
"""
# Get a set of already seen articles
seen_articles_ids = set(get_user_articles(user_id)[0])
# Get other users sorted by similarity
similar_users = find_similar_users(user_id=user_id)
recs = []
for user in similar_users:
if len(recs) >= m:
break
# Get articles seen by the similar user
similar_user_article_ids, similar_user_article_names = get_user_articles(user)
# Filter out the articles the current user has already seen
new_articles = list(set(similar_user_article_ids) - seen_articles_ids)
# Add the new articles to the recommendations list
recs.extend(new_articles)
# Return only the first m recommendations
return recs[:m]
# Check Results
get_article_names(user_user_recs(1, 10)) # Return 10 recommendations for user 1
['data visualization playbook: telling the data story', 'learn basics about notebooks and apache spark', 'apache spark lab, part 3: machine learning', 'analyze open data sets with pandas dataframes', 'perform sentiment analysis with lstms, using tensorflow', 'machine learning and the science of choosing', 'ml optimization using cognitive assistant', '54174 detect potentially malfunctioning sensors in r...\nName: title, dtype: object', 'dsx: hybrid mode', 'using rstudio in ibm data science experience']
# Test your functions here - No need to change this code - just run this cell
assert set(
get_article_names(["1024.0", "1176.0", "1305.0", "1314.0", "1422.0", "1427.0"])
) == set(
[
"using deep learning to reconstruct high-resolution audio",
"build a python app on the streaming analytics service",
"gosales transactions for naive bayes model",
"healthcare python streaming application demo",
"use r dataframes & ibm watson natural language understanding",
"use xgboost, scikit-learn & ibm watson machine learning apis",
]
), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_article_names(["1320.0", "232.0", "844.0"])) == set(
[
"housing (2015): united states demographic measures",
"self-service data preparation with ibm data refinery",
"use the cloudant-spark connector in python notebook",
]
), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_user_articles(20)[0]) == set(["1320.0", "232.0", "844.0"])
assert set(get_user_articles(20)[1]) == set(
[
"housing (2015): united states demographic measures",
"self-service data preparation with ibm data refinery",
"use the cloudant-spark connector in python notebook",
]
)
assert set(get_user_articles(2)[0]) == set(
["1024.0", "1176.0", "1305.0", "1314.0", "1422.0", "1427.0"]
)
assert set(get_user_articles(2)[1]) == set(
[
"using deep learning to reconstruct high-resolution audio",
"build a python app on the streaming analytics service",
"gosales transactions for naive bayes model",
"healthcare python streaming application demo",
"use r dataframes & ibm watson natural language understanding",
"use xgboost, scikit-learn & ibm watson machine learning apis",
]
)
print("If this is all you see, you passed all of our tests! Nice job!")
If this is all you see, you passed all of our tests! Nice job!
4. Now we are going to improve the consistency of the user_user_recs function from above.
Instead of arbitrarily choosing when we obtain users who are all the same closeness to a given user - choose the users that have the most total article interactions before choosing those with fewer article interactions.
Instead of arbitrarily choosing articles from the user where the number of recommended articles starts below m and ends exceeding m, choose articles with the articles with the most total interactions before choosing those with fewer total interactions. This ranking should be what would be obtained from the top_articles function you wrote earlier.
def get_top_sorted_users(user_id: int, df: pd.DataFrame = df, user_item: pd.DataFrame = user_item):
"""
INPUT:
user_id - (int)
df - (pandas dataframe) df as defined at the top of the notebook
user_item - (pandas dataframe) matrix of users by articles:
1's when a user has interacted with an article, 0 otherwise
OUTPUT:
neighbors_df - (pandas dataframe) a dataframe with:
neighbor_id - is a neighbor user_id
similarity - measure of the similarity of each user to the provided user_id
num_interactions - the number of articles viewed by the user - if a u
Other Details - sort the neighbors_df by the similarity and then by number of interactions where
highest of each is higher in the dataframe
"""
# Get other users similarity scores
similarity_scores = user_item.dot(user_item.loc[user_id])
neighbors_df = pd.DataFrame({
'neighbor_id': similarity_scores.index,
'similarity': similarity_scores.values,
'num_interactions': user_item.sum(axis=1)
})
# Sort by similarity and total interactions (respectively)
neighbors_df = neighbors_df.sort_values(by=['similarity', 'num_interactions'], ascending=False)
# Remove the user itself from the results
neighbors_df = neighbors_df[neighbors_df['neighbor_id'] != user_id]
return neighbors_df
def user_user_recs_part2(user_id: int, m: int = 10):
"""
INPUT:
user_id - (int) a user id
m - (int) the number of recommendations you want for the user
OUTPUT:
recs - (list) a list of recommendations for the user by article id
rec_names - (list) a list of recommendations for the user by article title
Description:
Loops through the users based on closeness to the input user_id
For each user - finds articles the user hasn't seen before and provides them as recs
Does this until m recommendations are found
Notes:
* Choose the users that have the most total article interactions
before choosing those with fewer article interactions.
* Choose articles with the articles with the most total interactions
before choosing those with fewer total interactions.
"""
recs = []
rec_names = []
# Get top sorted neighbors
neighbors_df = get_top_sorted_users(user_id)
# Get a set of already seen articles
seen_articles = set(get_user_articles(user_id)[0])
for _, row in neighbors_df.iterrows():
# Get articles seen by neighbot
neighbor_articles = set(get_user_articles(row['neighbor_id'])[0])
# Filter out the articles the current user has already seen
new_articles = list(set(neighbor_articles) - seen_articles)
# Sort articles based on total interactions (using sum across all users)
article_interaction_counts = user_item[[float(article_id) for article_id in new_articles]].sum(axis=0)
sorted_new_articles = article_interaction_counts.sort_values(ascending=False).index
# Add articles to recommendations list until we have m
for article_id in sorted_new_articles:
if len(recs) < m:
recs.append(article_id)
else:
break
# Get names of the recommended articles
rec_names = get_article_names(recs)
return recs, rec_names
# Quick spot check - don't change this code - just use it to test your functions
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)
The top 10 recommendations for user 20 are the following article ids: [1330.0, 1429.0, 1314.0, 1271.0, 43.0, 1351.0, 1336.0, 1368.0, 151.0, 1338.0] The top 10 recommendations for user 20 are the following article names: ['insights from new york car accident reports', 'use deep learning for image classification', 'healthcare python streaming application demo', 'customer demographics and sales', 'deep learning with tensorflow course by big data university', 'model bike sharing data with spss', 'learn basics about notebooks and apache spark', 'putting a human face on machine learning', 'jupyter notebook tutorial', 'ml optimization using cognitive assistant']
5. Use your functions from above to correctly fill in the solutions to the dictionary below. Then test your dictionary against the solution. Provide the code you need to answer each following the comments below.
### Tests with a dictionary of results
user1_most_sim = find_similar_users(user_id=1)[0]
user131_10th_sim = find_similar_users(user_id=131)[9]
## Dictionary Test Here
sol_5_dict = {
"The user that is most similar to user 1.": user1_most_sim,
"The user that is the 10th most similar to user 131": user131_10th_sim,
}
t.sol_5_test(sol_5_dict)
This all looks good! Nice job!
6. If we were given a new user, which of the above functions would you be able to use to make recommendations? Explain. Can you think of a better way we might make recommendations? Use the cell below to explain a better method for new users.
Provide your response here.
For a new user (who has no interaction history), we can use content-based filtering by recommending the most popular of articles related to categories or tags the user might be interested in. Another option (more realistic in our simplified case) is to recommend the most popular articles across all users.
7. Using your existing functions, provide the top 10 recommended articles you would provide for the a new user below. You can test your function against our thoughts to make sure we are all on the same page with how we might make a recommendation.
new_user = '0.0'
# What would your recommendations be for this new user '0.0'? As a new user, they have no observed articles.
# Provide a list of the top 10 article ids you would give to
new_user_recs = get_top_article_ids(10)
assert set(new_user_recs) == set(
[
"1314.0",
"1429.0",
"1293.0",
"1427.0",
"1162.0",
"1364.0",
"1304.0",
"1170.0",
"1431.0",
"1330.0",
]
), "Oops! It makes sense that in this case we would want to recommend the most popular articles, because we don't know anything about these users."
print("That's right! Nice job!")
That's right! Nice job!
Part IV: Content Based Recommendations (EXTRA - NOT REQUIRED)¶
Another method we might use to make recommendations is to perform a ranking of the highest ranked articles associated with some term. You might consider content to be the doc_body, doc_description, or doc_full_name. There isn't one way to create a content based recommendation, especially considering that each of these columns hold content related information.
1. Use the function body below to create a content based recommender. Since there isn't one right answer for this recommendation tactic, no test functions are provided. Feel free to change the function inputs if you decide you want to try a method that requires more input values. The input values are currently set with one idea in mind that you may use to make content based recommendations. One additional idea is that you might want to choose the most popular recommendations that meet your 'content criteria', but again, there is a lot of flexibility in how you might make these recommendations.
This part is NOT REQUIRED to pass this project. However, you may choose to take this on as an extra way to show off your skills.¶
from sklearn.feature_extraction.text import TfidfVectorizer
article_id=1027
top_n = 10
# Combine all text features into one combined text
df_content['combined_text'] = (
df_content['doc_body'].fillna('') + " " +
df_content['doc_description'].fillna('') + " " +
df_content['doc_full_name'].fillna('')
)
# Initialize and fit-transform the combined text
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_content['combined_text'])
#display(tfidf_vectorizer.vocabulary_)
cosine_sim = tfidf_matrix.toarray()@np.transpose(tfidf_matrix.toarray())
cosine_sim
array([[1. , 0.04841737, 0.12630925, ..., 0.04322074, 0.06067107,
0.12105066],
[0.04841737, 1. , 0.13440966, ..., 0.08038855, 0.02551396,
0.14014219],
[0.12630925, 0.13440966, 1. , ..., 0.17323255, 0.06897655,
0.25874571],
...,
[0.04322074, 0.08038855, 0.17323255, ..., 1. , 0.01875537,
0.10421227],
[0.06067107, 0.02551396, 0.06897655, ..., 0.01875537, 1. ,
0.07044439],
[0.12105066, 0.14014219, 0.25874571, ..., 0.10421227, 0.07044439,
1. ]], shape=(1056, 1056))
article_idx = df_content.index[df_content['article_id'] == article_id].tolist()[0]
similarity_scores = list(enumerate(cosine_sim[article_idx].flatten()))
similarity_scores = [(idx, score) for idx, score in similarity_scores if idx != article_idx]
similarity_scores_sorted = sorted(similarity_scores, key=lambda score_tuple: score_tuple[1], reverse=True)
[df_content.iloc[article_score[0]]['article_id'] for article_score in similarity_scores_sorted[:top_n]]
[np.int64(627), np.int64(439), np.int64(220), np.int64(623), np.int64(279), np.int64(457), np.int64(989), np.int64(709), np.int64(983), np.int64(327)]
def make_content_recs(article_id: float, df_content: pd.DataFrame = df_content, top_n: int = 10):
"""
INPUT:
article_id - (int) The article ID for which recommendations are required
df_content - (pd.DataFrame) The DataFrame containing article content
top_n - (int) The number of recommended articles to return
OUTPUT:
recommendations - (list) List of article IDs most similar to the input article_id
"""
# Combine all text features into one combined text
df_content['combined_text'] = (
df_content['doc_body'].fillna('') + " " +
df_content['doc_description'].fillna('') + " " +
df_content['doc_full_name'].fillna('')
)
# Initialize and fit-transform the combined text using bag of words and tf-idf transformer
# Limit features to 5000
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = tfidf_vectorizer.fit_transform(df_content['combined_text'])
# Compute cosine similarity between all articles (we can use sklearn.metrics.pairwise.cosine_similarity,
# sklearn.metrics.pairwise.linear_kernel, or just the dotproduct of matrix with itself)
cosine_sim = tfidf_matrix.toarray()@np.transpose(tfidf_matrix.toarray())
# Get the index of the input article
article_idx = df_content.index[df_content['article_id'] == article_id].tolist()
if not article_idx:
return []
article_idx = article_idx[0]
# Get similarity scores for the target article, drop the target article and sort them
similarity_scores = list(enumerate(cosine_sim[article_idx].flatten()))
similarity_scores = [(idx, score) for idx, score in similarity_scores if idx != article_idx]
similarity_scores_sorted = sorted(similarity_scores, key=lambda article_score: article_score[1], reverse=True)
# Extract top_n similar articles (starting with the second one to exclude)
top_similar_articles_ids = [df_content.iloc[article_score[0]]['article_id'] for article_score in similarity_scores_sorted[:top_n]]
# Map article ids to their names
article_id_title_mapping = df_content[["article_id", "doc_full_name"]].drop_duplicates().set_index("article_id").to_dict()["doc_full_name"]
top_similar_articles = list(map(lambda id: article_id_title_mapping.get(id), top_similar_articles_ids))
return top_similar_articles
make_content_recs(article_id=4)
['Publish notebooks to GitHub in DSX', 'Tour the Community in DSX', 'Manage Object Storage in DSX', 'Collaborate on projects in DSX', 'Build SQL queries with Apache Spark in DSX', 'Load and analyze public data sets in DSX', 'Create a project in DSX', 'Load data into RStudio for analysis in DSX', 'Work with Data Connections in DSX', 'Sign up for a free trial in DSX']
2. Now that you have put together your content-based recommendation system, use the cell below to write a summary explaining how your content based recommender works. Do you see any possible improvements that could be made to your function? Is there anything novel about your content based recommender?
This part is NOT REQUIRED to pass this project. However, you may choose to take this on as an extra way to show off your skills.¶
My approach is based on the knowledge I gained in this course about content-based recommendations and about NLP from previous lesson.
- I didn't focus much on distinguishing each part of docs in df_content and merged them altogether.
- Then I used TF-IDF approach with TfidfVectorizer method (that combines bag of words and TF-IDF transformation) to vectorize and normalize the input text. I used maximum features 5000.
- Using the resulting matrix I calculated similarities between each article as a dot product of the matrix with itself (cosine similarity).
- Then I searched for the most similar articles to the input article (through indexing and finding the article ids).
3. Use your content-recommendation system to make recommendations for the below scenarios based on the comments. Again no tests are provided here, because there isn't one right answer that could be used to find these content based recommendations.
This part is NOT REQUIRED to pass this project. However, you may choose to take this on as an extra way to show off your skills.¶
# make recommendations for a brand new user
# Since we have no information of this user, the best would be to recommend the most popular articles:
display(get_top_articles(n=10))
# make a recommendations for a user who only has interacted with article id '1027.0'
display(make_content_recs(article_id=1027))
['use deep learning for image classification', 'insights from new york car accident reports', 'visualize car data with brunel', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'healthcare python streaming application demo', 'finding optimal locations of new store using decision optimization', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model']
['Importing Redis data into Compose Redis', 'Seven Databases in Seven Days ā Day 7: Redis', 'A tour of the Redis stars', 'Redis Data Browser Now Available in Compose Dashboard', 'How to talk raw Redis', 'Getting Started with Compose and Bluemix', 'Mastering Redis high-availability and blocking connections', "A Quick Guide to Redis 3.2's Geo Support", 'Redis, Go, & How to Build a Chat Application', 'Enhanced Cloudant Search with Watson Alchemy']
Part V: Matrix Factorization¶
In this part of the notebook, you will build use matrix factorization to make article recommendations to the users on the IBM Watson Studio platform.
1. You should have already created a user_item matrix above in question 1 of Part III above. This first question here will just require that you run the cells to get things set up for the rest of Part V of the notebook.
# Load the matrix here
user_item = user_item.copy()
user_item_array = np.array(user_item)
# quick look at the matrix
user_item.head()
| article_id | 0.0 | 2.0 | 4.0 | 8.0 | 9.0 | 12.0 | 14.0 | 15.0 | 16.0 | 18.0 | ... | 1434.0 | 1435.0 | 1436.0 | 1437.0 | 1439.0 | 1440.0 | 1441.0 | 1442.0 | 1443.0 | 1444.0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | |||||||||||||||||||||
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows Ć 714 columns
2. In this situation, you can use Singular Value Decomposition from numpy on the user-item matrix. Use the cell to perform SVD, and explain why this is different than in the lesson.
# Perform SVD on the User-Item Matrix
u, s, vt = np.linalg.svd(user_item)
u_new = u[:, :len(s)]
s_new = np.diag(s)
vt_new = vt[:len(s), :]
This sum should be 0 if the SVD worked properly.
np.sum(user_item_array - ((u_new@s_new)@vt).round())
np.float64(0.0)
We don't need FunkSVD and can use the normal SVD because we are not dealing with sparse matrix. Our matrix user_item is binary, it contains 1 for cases where user interacted with article and 0 if they didn't. There is nothing between, no rating. We could use SVD on movies in the lesson if we didn't use ratings in the matrix, but only indicator 1-seen the moovie / 0-didn't see the movie. Or the other way - we would need FunkSVD here if we created user_item matrix not in binary way, but with number of interactions as the values - so there would be 0 if user didn't interact with article, 1 if interacted once, 123 if interacted 123 times.
3. Now for the tricky part, how do we choose the number of latent features to use? Running the below cell, you can see that as the number of latent features increases, we obtain a lower error rate on making predictions for the 1 and 0 values in the user-item matrix. Run the cell below to get an idea of how the accuracy improves as we increase the number of latent features.
num_latent_feats = np.arange(10, 700 + 10, 20)
sum_errs = []
for k in num_latent_feats:
# restructure with k latent features
s_new, u_new, vt_new = np.diag(s[:k]), u[:, :k], vt[:k, :]
# take dot product
user_item_est = np.around(np.dot(np.dot(u_new, s_new), vt_new))
# compute error for each prediction to actual value
diffs = np.subtract(user_item_array, user_item_est)
# total errors and keep track of them
err = np.sum(np.sum(np.abs(diffs)))
sum_errs.append(err)
plt.plot(num_latent_feats, 1 - np.array(sum_errs) / df.shape[0])
plt.xlabel("Number of Latent Features")
plt.ylabel("Accuracy")
plt.title("Accuracy vs. Number of Latent Features")
Text(0.5, 1.0, 'Accuracy vs. Number of Latent Features')
4. From the above, we can't really be sure how many features to use, because simply having a better way to predict the 1's and 0's of the matrix doesn't exactly give us an indication of if we are able to make good recommendations. Instead, we might split our dataset into a training and test set of data, as shown in the cell below.
Use the code from question 3 to understand the impact on accuracy of the training and test sets of data with different numbers of latent features. Using the split below:
- How many users can we make predictions for in the test set?
- How many users are we not able to make predictions for because of the cold start problem?
- How many articles can we make predictions for in the test set?
- How many articles are we not able to make predictions for because of the cold start problem?
df_train = df.head(40000)
df_test = df.tail(5993)
def create_test_and_train_user_item(df_train: pd.DataFrame, df_test: pd.DataFrame):
"""
INPUT:
df_train - training dataframe
df_test - test dataframe
OUTPUT:
user_item_train - a user-item matrix of the training dataframe
(unique users for each row and unique articles for each column)
user_item_test - a user-item matrix of the testing dataframe
(unique users for each row and unique articles for each column)
test_idx - all of the test user ids
test_arts - all of the test article ids
train_idx - all of the train user ids
train_arts - all of the train article ids
"""
user_item_train = df_train.pivot_table(index="user_id", columns='article_id', aggfunc='size', fill_value=0)
user_item_train[user_item_train > 0] = 1
user_item_test = df_test.pivot_table(index="user_id", columns='article_id', aggfunc='size', fill_value=0)
user_item_test[user_item_test > 0] = 1
test_idx = df_test['user_id'].unique()
test_arts = df_test['article_id'].unique()
train_idx = df_train['user_id'].unique()
train_arts = df_train['article_id'].unique()
return user_item_train, user_item_test, test_idx, test_arts, train_idx, train_arts
user_item_train, user_item_test, test_idx, test_arts, train_idx, train_arts = create_test_and_train_user_item(
df_train, df_test
)
len(test_idx)
682
len(train_idx)
4487
set(train_idx) & set(test_idx)
{np.int64(2917),
np.int64(3024),
np.int64(3093),
np.int64(3193),
np.int64(3527),
np.int64(3532),
np.int64(3684),
np.int64(3740),
np.int64(3777),
np.int64(3801),
np.int64(3968),
np.int64(3989),
np.int64(3990),
np.int64(3998),
np.int64(4002),
np.int64(4204),
np.int64(4231),
np.int64(4274),
np.int64(4293),
np.int64(4487)}
len(set(train_idx) & set(test_idx))
20
len(test_arts)
574
len(train_arts)
714
len(set(train_arts) & set(test_arts))
574
user_item_test.shape
(682, 574)
# Replace the values in the dictionary below
a = 662
b = 574
c = 20
d = 0
sol_4_dict = {
'How many users can we make predictions for in the test set?': c,
'How many users in the test set are we not able to make predictions for because of the cold start problem?': a,
'How many articles can we make predictions for in the test set?': b,
'How many articles in the test set are we not able to make predictions for because of the cold start problem?': d
}
t.sol_4_test(sol_4_dict)
Awesome job! That's right! All of the test articles are in the training data, but there are only 20 test users that were also in the training set. All of the other users that are in the test set we have no data on. Therefore, we cannot make predictions for these users using SVD.
5. Now use the user_item_train dataset from above to find U, S, and V transpose using SVD. Then find the subset of rows in the user_item_test dataset that you can predict using this matrix decomposition with different numbers of latent features to see how many features makes sense to keep based on the accuracy on the test data. This will require combining what was done in questions 2 - 4.
Use the cells below to explore how well SVD works towards making predictions for recommendations on the test data.
Use these cells to see how well you can use the training decomposition to predict on test data
def get_common_users_and_articles(user_item_train: pd.DataFrame, user_item_test: pd.DataFrame):
common_users = np.intersect1d(user_item_train.index, user_item_test.index)
common_articles = np.intersect1d(user_item_train.columns, user_item_test.columns)
return common_users, common_articles
def svd_decomposition(user_item_matrix: pd.DataFrame, k: int):
u, s, vt = np.linalg.svd(user_item_matrix, full_matrices=False)
u_k = u[:, :k]
# use k-limit also for sigma although it shouldn't be necessary since full_matrices=False is used during svd()
s_k = np.diag(s[:k])
vt_k = vt[:k, :]
return u_k, s_k, vt_k
def predict_svd(u_k: np.array, s_k: np.array, vt_k: np.array):
return (u_k@s_k)@vt_k
def evaluate_metrics(predictions, user_item_test, common_users, common_articles):
common_user_idx = [np.where(user_item_train.index == user)[0][0] for user in common_users]
common_article_idx = [np.where(user_item_train.columns == article)[0][0] for article in common_articles]
test_subset = user_item_test.loc[common_users, common_articles]
pred_subset = predictions[np.ix_(common_user_idx, common_article_idx)]
tp = np.sum((pred_subset > 0.5) & (np.array(test_subset) == 1))
tn = np.sum((pred_subset <= 0.5) & (np.array(test_subset) == 0))
p = np.sum(np.array(test_subset) == 1)
n = np.sum(np.array(test_subset) == 0)
rec = tp / p if p != 0 else 0
spec = tn / n if p != 0 else 0
acc = (tp + tn) / (p + n) if p+n != 0 else 0
return rec, acc, spec
Test the above functions on train data itself - all the metrics should be 1 (100 % identity of prediction and original data)¶
common_users, common_articles = get_common_users_and_articles(user_item_train, user_item_test)
k = user_item_train.shape[1]
u_k, s_k, vt_k = svd_decomposition(user_item_train, k)
predictions = predict_svd(u_k, s_k, vt_k).round()
rec, acc, spec = evaluate_metrics(predictions=predictions, user_item_test=user_item_train, common_users=common_users, common_articles=common_articles)
print(f"Recall: {rec}\nAccuracy: {acc}\nSpecificity: {spec}")
Recall: 1.0 Accuracy: 1.0 Specificity: 1.0
Apply the method on test data¶
common_users, common_articles = get_common_users_and_articles(user_item_train, user_item_test)
k_values = np.arange(10, 700 + 10, 20)
accuracy = []
recall = []
specificity = []
for k in k_values:
# Get the SVD decomposition with k features
u_k, s_k, vt_k = svd_decomposition(user_item_train, k)
# Get the predictions for the full user-item matrix
predictions = predict_svd(u_k, s_k, vt_k).round()
# Evaluate the MSE for the common users and articles subset
rec, acc, spec = evaluate_metrics(predictions=predictions, user_item_test=user_item_test, common_users=common_users, common_articles=common_articles)
recall.append(rec)
specificity.append(spec)
accuracy.append(acc)
plt.figure()
plt.plot(k_values, np.array(recall))
plt.xlabel("Number of Latent Features")
plt.ylabel("Metric")
plt.title("Recall")
plt.show()
plt.figure()
plt.plot(k_values, np.array(accuracy))
plt.xlabel("Number of Latent Features")
plt.ylabel("Metric")
plt.title("Accuracy")
plt.show()
plt.figure()
plt.plot(k_values, np.array(specificity))
plt.xlabel("Number of Latent Features")
plt.ylabel("Metric")
plt.title("Specificity")
plt.show()
6. Use the cell below to comment on the results you found in the previous question. Given the circumstances of your results, discuss what you might do to determine if the recommendations you make with any of the above recommendation systems are an improvement to how users currently find articles?
The explanation of the graphs:
Accuracy goes hand in hand with specificity in our case. That is because there are a lot of negative values (meaning zeros), it's 11262 in test data, while positives (ones) are only 218. Therefore even the most simple approach with just one latent feature produces the "best" results, since the metrics is strongly biased by the common negatives. MSE gives similar results but I didn't plot it since I find accuracy more suitable here. Then, based on these metrics, it doesn't make sense to increase the number of latent features.
However! If we take a look at recall, we might see that this one is actually increasing with latent features - that is because it focuses only on the true positives and with more latent features (up to around 200) we are improving the performance of our model indeed.
This model with higher number of latent features (around 200 as mentioned) would therefore give actually more correct recommendations, while it would also increase the number of recommendations outside of the user's interest. If we wanted a better model, we would definitely need more common users between train and test data and also we would need them to be more consistent in articles they are interacting with - one of the posibilities for this model not performing super well is the fact that users moves a bit from their articles from train in test data. Other way to solve our cold start problem would be to use mixture of content-based recommendations with SVD and user-user based collaborative filtering.
test_subset = user_item_test.loc[common_users, common_articles]
print(f"Positives: {np.sum(np.array(test_subset) == 1)}")
print(f"Negatives: {np.sum(np.array(test_subset) == 0)}")
Positives: 218 Negatives: 11262